igraph package.This tutorial/primer is inspired and based on material presented in this blog post, this tutorial, this RPub, the igraph documentation, the tidygraph documentation, and the ggraph documentation.
igraphigraph is a large library of graph theory related functions which is available for C, python, and R. We will be using the R package, but much of what we will cover is general in nature. Additionally, most packages that interface with igraph have joint documentation – you’ll need to understand igraph in order to use tidygraph.
igraph began development well before the tidyverse, so it does not natively support a lot of the grammar we might be used to. The packages tidygraph and ggraph fill-in this missing functionality.
library(magrittr)
library(viridis) # good continuous color palette
library(tidyverse)
library(igraph)
library(ggraph)
library(tidygraph)
A graph is a set of objects and in which some pairs of the objects are “related.” The objects are called nodes (or vertices) and pairs of nodes are related by edges (or links). These edges can be either directed or undirected, and they can be weighted or unweighted. While there are many kinds of graphs, all are representations of relational data.
A typical graph can be expressed as an ordered pair of vectors \(G = (V, E)\), with \(V\) being the set of nodes and \(E\) being set of edges, each of which is defined by the related pair of nodes. The order of a graph is the number of nodes in the graph, \(|V|\). The size of a graph is the number of edges in the graph, \(|E|\).
We’re going to start by making our graphs by hand. This activity will help us understand what is happening when we induce graphs from our actual data.
Let’s start with the graph defined \[ \begin{align} V &= \{1, 2, 3, 4\} \\ E &= \{ \{1, 2\}, \{2, 3\}, \{2, 4\}, \{4, 1\} \}. \\ \end{align} \] Try drawing this graph out by hand first.
Now let’s make a graph using the edge list.
edge_list <- tibble(from = c(1, 2, 2, 4),
to = c(2, 3, 4, 1)) # define the relationships
node_list <- tibble(id = 1:4) # name the nodes
g1 <- tbl_graph(nodes = node_list,
edges = edge_list,
directed = FALSE) %T>%
print() # what does this object *look* like?
## # A tbl_graph: 4 nodes and 4 edges
## #
## # An undirected simple graph with 1 component
## #
## # Node Data: 4 x 1 (active)
## id
## <int>
## 1 1
## 2 2
## 3 3
## 4 4
## #
## # Edge Data: 4 x 2
## from to
## <int> <int>
## 1 1 2
## 2 2 3
## 3 2 4
## # ... with 1 more row
ggraph(g1) +
geom_edge_link() +
geom_node_point(size = 5) +
geom_node_text(aes(label = id), repel = TRUE) +
theme_graph()
## Using `nicely` as default layout
Edge lists are not the only way of representing relational data. Another common format is the “adjacency matrix.” This is a square matrix of nodes by nodes, with the corresponding element being 1 if they share an edge or 0 if not. You might be familiar with these from ecology where they are called co-occurrence matrices, and the induced graph called a co-occurrence network.
Here is the basic strategy for making a graph object from an adjacency matrix. If the adjacency matrix is symmetrical then the induced graph is undirected. If the square adjacency matrix is not symmetrical, then it could be a directed graph.
# random adjacency matrix
adjm <- matrix(c(0, 1, 0, 1,
1, 0, 1, 1,
0, 1, 0, 0,
1, 1, 0, 0), # symmetrical
ncol = 4)
as_tbl_graph(adjm, directed = FALSE) %>%
ggraph(.) +
geom_edge_link() +
geom_node_point(size = 5) +
theme_graph()
## Using `nicely` as default layout
Sometimes a graph can have weighted edges, where some connections are considered greater than others. For example, if nodes represent communities than the weight of the edge can represent the number of shared species between those communities.
Here is an example of making a graph from an edge list with weights.
edge_list <- tibble(from = c(1, 2, 2, 4),
to = c(2, 3, 4, 1),
weight = c(1, 2, 2, 1)) # define the relationships
node_list <- tibble(id = 1:4) # name the nodes
g2 <- tbl_graph(nodes = node_list,
edges = edge_list,
directed = FALSE)
ggraph(g2) +
geom_edge_link(aes(width = weight)) +
geom_node_point(size = 5) +
theme_graph()
## Using `nicely` as default layout
When making a graph from an adjacency matrix, the elements can be used to represent the weights.
# random adjacency matrix
adjm2 <- matrix(c(0, 1, 0, 1,
1, 0, 2, 2,
0, 2, 0, 0,
1, 2, 0, 0), # symmetrical
ncol = 4)
as_tbl_graph(adjm2, directed = FALSE) %>%
ggraph(.) +
geom_edge_link(aes(width = weight)) +
geom_node_point(size = 5) +
theme_graph()
## Using `nicely` as default layout
Additionally, some graphs are considered directed. Edges in a directed graph have a direction linking one node to another, but the reverse is not guaranteed. Twitter follows are directed edges – you follow someone, but they don’t necessarily follow you. Our previous graphs were all undirected, which is more akin to a Facebook friend – your friendship is shared and not one-sided.
Let’s make an example directed graph.
# random adjacency matrix
adjm_dir <- matrix(c(0, 0, 0, 0,
1, 0, 0, 0,
0, 1, 0, 0,
1, 1, 0, 0), # asymmetrical
ncol = 4)
as_tbl_graph(adjm_dir, directed = TRUE) %>%
ggraph(.) +
geom_edge_link(arrow = arrow(),
start_cap = circle(3, 'mm'),
end_cap = circle(3, 'mm')) +
geom_node_point(size = 5) +
theme_graph()
## Using `nicely` as default layout
Bipartite graphs are a special type of graph that is particularly relevant to (paleo)biological analysis. In a bipartite graph the nodes are divided into two disjoint and independent sets, \(U\) and \(V\). Disjoint and independent means that nodes within each set do not share any connections, but that there are only connections between nodes of different sets. We amend our earlier notation for a graph to \(G = (U, V, E)\) to denote these two distinct sets of nodes.
Here is an example bipartite graph \[ \begin{align} U &= \{1, 2, 3\} \\ V &= \{3, 5\} \\ E &= \{ \{1, 4\}, \{1, 5\}, \{2, 5\}, \{3, 5\} \}. \\ \end{align} \] Try drawing this graph out by hand first.
g3 <- make_bipartite_graph(types = c(0, 0, 0, 1, 1), # node membership
edges = c(1, 4,
1, 5,
2, 5,
3, 5)) %>% # each line is an edge
as_tbl_graph() %T>%
print() # what does this object *look* like?
## # A tbl_graph: 5 nodes and 4 edges
## #
## # A bipartite simple graph with 1 component
## #
## # Node Data: 5 x 1 (active)
## type
## <lgl>
## 1 FALSE
## 2 FALSE
## 3 FALSE
## 4 TRUE
## 5 TRUE
## #
## # Edge Data: 4 x 2
## from to
## <int> <int>
## 1 1 4
## 2 1 5
## 3 2 5
## # ... with 1 more row
ggraph(g3, layout = 'bipartite') +
geom_edge_link() +
geom_node_point(aes(colour = type),
size = 5) +
theme_graph() +
scale_colour_manual(values = c('goldenrod', 'skyblue'))
A bipartite network can also be called two-mode graph because it can be projected into two one-mode networks. Each of the one-mode networks summarize only one set of nodes by compressing the second set of nodes into the edges of the first. This is intuitive to visualize.
First, draw out what you think the two one-mode projections look like. Then compare with this code snippet.
g3_proj <- bipartite_projection(g3) %>%
map(., ~ as_tbl_graph(.x, directed = FALSE)) %T>%
print()
## $proj1
## # A tbl_graph: 3 nodes and 3 edges
## #
## # An undirected simple graph with 1 component
## #
## # Node Data: 3 x 0 (active)
## #
## # Edge Data: 3 x 3
## from to weight
## <int> <int> <dbl>
## 1 1 2 1
## 2 1 3 1
## 3 2 3 1
##
## $proj2
## # A tbl_graph: 2 nodes and 1 edges
## #
## # An undirected simple graph with 1 component
## #
## # Node Data: 2 x 0 (active)
## #
## # Edge Data: 1 x 3
## from to weight
## <int> <int> <dbl>
## 1 1 2 1
# first projection
ggraph(g3_proj$proj1) +
geom_edge_link() +
geom_node_point(colour = 'goldenrod',
size = 5) +
theme_graph()
## Using `nicely` as default layout
# second projection
ggraph(g3_proj$proj2) +
geom_edge_link() +
geom_node_point(colour = 'skyblue',
size = 5) +
theme_graph()
## Using `nicely` as default layout
Now that we’ve covered a lot of the basics behind graphs and how to make them, let’s start apply this knowledge to some fossil occurrence information.
A relatively recent innovation in paleobiology is the application of networks to understanding fossil occurrences in space and time. These networks appear either as co-occurrence networks or bipartite taxon-locality networks. We’re going to cover how to make both. We’re going to start by making a bipartite network and then use the one-mode projections to give us co-occurrence networks.
For this exercise we will be using the Quaternary record of Canidae. Our data will be sourced directly from the Paleobiology Database. I’m using the following automatic filtering criteria: taxonomic name Canidae, interval Quaternary, identity resolved at least to genus (lumping genus and species as a single occurrence), only valid taxonomic names, and will all metadata. I’ve formulated this as a url so we can directly query the PBDB from our code. For more information how to to format this url call, check out the API documentation.
url <- 'https://paleobiodb.org/data1.2/occs/list.txt?base_name=Canidae&interval=Quaternary&idreso=lump_genus&taxon_status=valid&show=full'
canidae <- read_csv(file = url) %>%
filter(!is.na(formation), # need formation information
!str_detect(formation, '[:punct:]')) %>% # get rid of ambiguous entries
mutate(formation = str_to_title(formation)) # consistency
## Warning: Duplicated column names deduplicated: 'cc' => 'cc_1' [43]
# make a taxon-locality graph
canidae_graph <- canidae %>%
dplyr::select(accepted_name, formation) %>%
distinct() %>% # only unique genus-formation pairs
graph.data.frame(., directed = FALSE) %>%
as_tbl_graph(.) %>% # make it nice
activate(nodes) %>% # focus on node properties
mutate(type = bipartite_mapping(.)$type, # place in a partition
type_name = case_when(type == TRUE ~ 'Formation',
type == FALSE ~ 'Genus')) %T>%
print() # look at this graph object
## # A tbl_graph: 71 nodes and 89 edges
## #
## # A bipartite simple graph with 3 components
## #
## # Node Data: 71 x 3 (active)
## name type type_name
## <chr> <lgl> <chr>
## 1 Canis FALSE Genus
## 2 Vulpes FALSE Genus
## 3 Urocyon FALSE Genus
## 4 Xenocyon FALSE Genus
## 5 Alopex FALSE Genus
## 6 Lycaon FALSE Genus
## # ... with 65 more rows
## #
## # Edge Data: 89 x 2
## from to
## <int> <int>
## 1 1 14
## 2 1 15
## 3 2 15
## # ... with 86 more rows
# visualize the bipartite graph
ggraph(canidae_graph, layout = 'bipartite') + # there is no *best* layout
geom_edge_link() +
geom_node_point(aes(colour = type_name),
size = 5) +
theme_graph() +
scale_colour_manual(values = c('goldenrod', 'skyblue'))
This network that we’ve just made and visualized is a biogeographic occurrence network which Canidae genera appear in various geological formations.
As demonstrated earlier, the bipartite network encodes a lot of information including both one-mode projections which describe genus-genus co-occurrence where edges represent occuring in the same formation, and formation-formation co-occurrence where edges represent genera occurring in both formations.
When we make these one-mode projections edge weights are calculated as the edge multiplicity. This means that, for example, if Location A and Location B share Species 1 and Species 2, the edge connecting Locations A and B has weight 2.
# decompose into the one-mode networks for co-occurrence
canidae_graph_proj <- bipartite_projection(canidae_graph) %>%
map(., ~ as_tbl_graph(.x, directed = FALSE)) %T>%
print() # look at this new object
## $proj1
## # A tbl_graph: 13 nodes and 18 edges
## #
## # An undirected simple graph with 3 components
## #
## # Node Data: 13 x 2 (active)
## name type_name
## <chr> <chr>
## 1 Canis Genus
## 2 Vulpes Genus
## 3 Urocyon Genus
## 4 Xenocyon Genus
## 5 Alopex Genus
## 6 Lycaon Genus
## # ... with 7 more rows
## #
## # Edge Data: 18 x 3
## from to weight
## <int> <int> <dbl>
## 1 1 2 11
## 2 1 3 6
## 3 1 4 1
## # ... with 15 more rows
##
## $proj2
## # A tbl_graph: 58 nodes and 1177 edges
## #
## # An undirected simple graph with 3 components
## #
## # Node Data: 58 x 2 (active)
## name type_name
## <chr> <chr>
## 1 Palm Spring Formation
## 2 Kingsdown Formation
## 3 Manix Formation
## 4 Cape Deceit Formation
## 5 Tacubaya Formation
## 6 Bermont Formation
## # ... with 52 more rows
## #
## # Edge Data: 1,177 x 3
## from to weight
## <int> <int> <dbl>
## 1 1 2 1
## 2 1 3 1
## 3 1 4 1
## # ... with 1,174 more rows
We can then visualize each of the one-mode networks.
# genus co-occurrence network
ggraph(canidae_graph_proj$proj1) +
geom_edge_link() +
geom_node_point(size = 5,
colour = 'skyblue') +
geom_node_text(aes(label = name),
repel = TRUE) +
theme_graph()
## Using `nicely` as default layout
# formation co-occurrence network
ggraph(canidae_graph_proj$proj2) +
geom_edge_link() +
geom_node_point(size = 5,
colour = 'goldenrod') +
geom_node_text(aes(label = name),
repel = TRUE) +
theme_graph()
## Using `nicely` as default layout
Now that we’ve made our graph, what do we do with it? What properties of our graph might be interesting?
There are a lot of ways to summarize properties of our graph. I’m presenting a few here, but these are by no means exhaustive. Also, a lot of these measures are not explicitly defined for bipartite graphs, so while me might get values out the other end they might not mean exactly what we had in mind.
edge_density(canidae_graph_proj$proj1)
## [1] 0.2307692
# diameter length
diameter(canidae_graph_proj$proj1)
## [1] 9
# diameter path
get_diameter(canidae_graph_proj$proj1)
## + 5/13 vertices, named, from 53b71df:
## [1] Urocyon Canis Chrysocyon Protocyon Cerdocyon
dd <- degree_distribution(canidae_graph_proj$proj1)
ggplot(data = tibble(x = seq(from = 0,
to = length(dd) - 1),
y = dd),
mapping = aes(x = x, y = y)) +
geom_point() +
geom_line() +
labs(x = 'Degree',
y = 'Percentage of nodes with that degree') +
scale_x_continuous(breaks = seq(from = 0,
to = length(dd) - 1,
by = 2)) +
theme_minimal()
# Biogeographic connectedness
biogeo_connect <- function(g) {
if(!is_bipartite(g)) {
return(paste('graph object must be bipartite.'))
}
oo <- gsize(g)
bp <- bipartite_mapping(g)$type
nn <- sum(bp == FALSE) # assume the false nodes are taxa
ll <- sum(bp == TRUE) # assume the true nodes are localities
bc <- (oo - nn) / (ll * nn - nn)
bc
}
biogeo_connect(canidae_graph)
## [1] 0.1025641
cluster_infomap(canidae_graph)$codelength # summary stat
## [1] 4.599039
The major category of summary statistics are centrality measures – how close is each node to the “center” of the graph? There are many different kinds of centrality measures, each with their own logic. A famous example is the PageRank algorithm which is the core of the Google search engine. All centrality measures are properties of the nodes. Here is a very brief introduction to five different centrality measures.
Degree centrality is the number of edges associated with that node.
Closeness is the inverse of the average length of the shortest path between that node and all other nodes.
Eigenvector centrality scores correspond to the values of the first eigenvector of the graph adjacency matrix. This measure can be interpreted reciprocally as the centrality of each node is proportional to the sum of the centralities of those nodes to whom he or she is connected. If you keep working with graphs, you will learn that the spectral properties of a graph (e.g. eigenvectors) have a lot of important properties. For example, the leading eigenvalue of a predator-prey network is a measure of community stability.
Betweenness is roughly defined as the number of shortest paths that path through that node – how central is that node for getting from one node to another.
PageRank follows the logic that important nodes are connected to other important nodes. This is the algorithm at the core of Google’s search engine – important websites frequently link to and are linked by other important websites. Conceptually, this algorithm envisions a random web surfer, which moves around the network but occasionally teleports (e.g. hits the home button) and starts walking from a completely different part of the network.
There are many more – tidygraph uses the common predicate centrality_*, so I encourage you to check out some of the other options.
Most centrality measures are not properly defined for bipartite networks as they assume all nodes belong to the same set. This does not mean these function calls will fail, just that the answer might not be what we expect.
Let’s apply these centrality measures to one-way projections. centrality_* functions help us add features to our graphs which we can use to help visualize important parts of our graphs.
canidae_graph_proj$proj1 %>%
activate(nodes) %>% # focus on the nodes
mutate(degree = centrality_degree(),
pagerank = centrality_pagerank()) %>%
ggraph() +
geom_edge_link() +
geom_node_point(aes(size = degree,
colour = pagerank)) +
theme_graph() +
scale_colour_viridis()
## Using `nicely` as default layout
canidae_graph_proj$proj2 %>%
activate(nodes) %>% # focus on the nodes
mutate(degree = centrality_degree(),
pagerank = centrality_pagerank()) %>%
ggraph() +
geom_edge_link() +
geom_node_point(aes(size = degree,
colour = pagerank)) +
theme_graph() +
scale_colour_viridis()
## Using `nicely` as default layout
If we are just interested in the network summaries, we need to use the basic igraph functions – tidygraph objects are not compatible with dplyr::summary(...).
degree(canidae_graph_proj$proj1)
page_rank(canidae_graph_proj$proj1)
degree(canidae_graph_proj$proj2)
page_rank(canidae_graph_proj$proj2)
A common desire in paleobiological analysis of graphs is estimating the presence of distinct communities – two or more groups of nodes that are more highly connected within each group than between the groups. For example, Muscente et al. PNAS 2018 is mostly just community detection applied to a taxon-taxon co-occurrence graph.
There are many community detection algorithms which take many different approaches to breaking up the network. Community detection algorithms are also complicated and highly varied. Some algoritms take into account edge weights or directed edges, but many do not. Always read the documentation! It is your job to make sure a method is appropriate for your network.
Here is a brief introduction to a few commonly used techniques.
Clustering algorithms like the well-known standard hierarchical clustering can be applied to networks.
Modularity based methods. Modularity is the “strength” of division in a network. The modularity of a network is the fraction of edges that fall within a group minus the expected fraction if edges were assigned randomly. High modularity means there dense connections between the nodes within modules but sparse connections between nodes. Communities are made by this cuts which maximize the modularity of the network, which then partitions the network into distinct subnetworks. There are a few different optimization algorithms that try to maximize the network’s modularity: fast greedy optimization, full optimization (slow), and hybrid methods like Louvian (popular). Overall, these methods choke on larger networks and suffer from resolution limits – maximizing modularity is potentially slow while not necessarily being a good approach.
Label Propagation labels each node with unique labels and then updating the labels by majority voting in the neighborhood of the node. Densely connected nodes will form a consensus label and thus represent a community.
Walktrap is similar to InfoMap in that it envisions a random walker on the graph. Short random walks tend to stay in the same community of a graph.
InfoMap is like Walktrap plus PageRank. I briefly introduced this algorithm when discussing the code length total graph summary statistic, so this is moslty a repeat of that earlier information.
This algorithm imagines a random walk on the graph that can teleportation at random intervals (the same thing as the “random web surfer” from PageRank.) This algorithm seeks to minimize the expected descriptive length of that walk – by identifying communities, describing the graph as a binary code takes fewer bits. Also, this is the only algorithm discussed here that is defined to work on bipartite graphs (this is not well known).
This algorithm also returns the number of bits necessary to describe the network, a criminally underused value for comparing networks in paleobiology – see Sidor et al PNAS 2013 for an example usage. The shorter the code length, the simpler the graph.
I encourage you check out http://www.mapequation.org/ for more information.
There are many more – tidygraph uses the common predicate group_*, so I encourage you to check out some of the other options. In general, I use the InfoMap algorithm – it is defined by biogeographic networks and has better properties and performance compared to nearly every other algorithm. As with the centrality_* family of functions, the group_* family also gives use features for each of the nodes. These features can help us color or shape our nodes by their community memberships. Additionally, the cluster_* family of functions in the igraph library are useful when we want to observe network summaries directly.
Let’s see how many communities are identified in our Canidae data. We have three graphs we could analyze: the bipartite genus-formation graph, the one-mode genus-genus co-occurrence graph, and the one-mode formation-formation co-occurrence graph. Community detection on the bipartite graph will identify groups genus-formation pairs that commonly occur, while community detection on the one-way graphs will only identify genera or formations with similar occurrence patterns. I’m only going to be using the InfoMap algorithm, but feel free to try others.
graph_code <- cluster_infomap(canidae_graph)$codelength # summary stat
# community detection on bipartite genus-formation network
canidae_graph %>%
activate(nodes) %>% # node properties
mutate(group = as.factor(group_infomap())) %>%
ggraph(layout = 'bipartite') +
geom_edge_link() +
geom_node_point(aes(colour = group),
size = 5) +
geom_node_text(aes(label = name),
repel = TRUE) +
theme_graph() +
scale_colour_discrete(name = 'Membership') +
labs(subtitle = paste0('Code length = ', round(graph_code, digits = 2)))
# community detection on genus-genus co-occurrence network
genus_code <- cluster_infomap(canidae_graph_proj$proj1)$codelength
canidae_graph_proj$proj1 %>%
activate(nodes) %>% # node properties
mutate(group = as.factor(group_infomap())) %>%
ggraph() +
geom_edge_link() +
geom_node_point(aes(colour = group),
size = 5) +
geom_node_text(aes(label = name),
repel = TRUE) +
theme_graph() +
scale_colour_discrete(name = 'Membership') +
labs(subtitle = paste0('Code length = ', round(genus_code, digits = 2)))
## Using `nicely` as default layout
# community detection on formation-formation co-occurrence network
formation_code <- cluster_infomap(canidae_graph_proj$proj2)$codelength
canidae_graph_proj$proj2 %>%
activate(nodes) %>% # node properties
mutate(group = as.factor(group_infomap())) %>%
ggraph() +
geom_edge_link() +
geom_node_point(aes(colour = group),
size = 5) +
geom_node_text(aes(label = name),
repel = TRUE) +
theme_graph() +
scale_colour_discrete(name = 'Membership') +
labs(subtitle = paste0('Code length = ', round(formation_code, digits = 2)))
## Using `nicely` as default layout
In this lesson we covered the basic definition of a graph/network as mathematical construct made of nodes and edges. This definition includes concepts like directed versus undirected edges and weighted edges. The basics of creating graphs from edge lists or adjacency matrices was covered as a prelude to processing PBDB data into a biogeographical network. We also introduced a special type of network called a bipartite network. We also were introduced to a variety of ways and perspectives of summarizing a network – these included whole graph summaries, node and centrality properties, and community detection.
Sometimes we want to generate a random graph instead of writing one out by hand – it certainly makes writing examples easier! There are many ways to generate random graphs. tidygraph defines four types of “games,” or ways to generate random graphs: component, evolution, sampling, and type. Here we will briefly discuss one example of each of these game types. None of these approaches are necessarily better than the others, but each can be very useful for many reasons. After all, truth only exists in simulation.
In tidygraph all of the functions for generating random graphs have the predicate play_*. I’m only introducing one example from each game, but you are encouraged to explore what other options exist.
Let’s start with the classic Erdos-Renyi graph model. This model has its own classic notation that breaks some of earlier statements. The Erdos-Renyi graph model is defined two ways: a random graph defined be the number of nodes \(n\) and the probability of an edge occurring \(p\), \(G(n, p)\); or a random graph defined by the number of nodes \(n\) and the number of edges \(m\), \(G(n, m)\).
A major statistical property of the Erdos-Renyi model is that the generated graph should have a Poisson degree distribution for large \(n\). It also turns out that this is a terrible model of real world networks because degree distribution real world networks are believed to be much more heavily tailed than the Poisson distribution..
play_erdos_renyi(n = 100, # nodes
p = 0.1) %>% # probability of edge occurring
ggraph(.) +
geom_edge_link() +
geom_node_point(size = 3,
colour = 'blue') +
theme_graph()
## Using `nicely` as default layout
The Barabasi-Albert model is slightly more realistic model or real world networks than the Erdos-Renyi model. Conceptually, the Barabasi-Albert model incorporates the concepts of growth and preferential attachment.
Growth means that the number of nodes in the network increases over time. This means that the network begins with a base number of connected nodes and then adds more nodes and edges one node at a time.
Preferential attachment means that the more edges a node has, the more likely it is to gain more edges. As the network grows, a node with a high degree has a stronger ability to add edges to the network than a node with few edges.
The Barabasi-Albert model generates scale-free networks, meaning that the degree distribution follows a power law. In the Barabasi-Albert model’s case, the fraction \(P(k)\) of nodes in the network having \(k\) edges is \[ P(k) \sim k^{-3}. \]
play_barabasi_albert(n = 100, # number of nodes
power = 1, # attachment power (default = 1)
growth = 1) %>% # default = 1
ggraph(.) +
geom_edge_link() +
geom_node_point(size = 3,
colour = 'blue') +
theme_graph()
## Using `nicely` as default layout
We’ve already been introduced to bipartite graphs above when we wrote one out by hand. That’s very tedious. Here’s a function for generating a random bipartite network. You have to define the order of each set of nodes and the probability that a node from each set have an edge. Edges are assigned in the exact same manner as the Erdos-Renyi graph model.
play_bipartite(n1 = 30, # nodes in "top" part
n2 = 70, # nodes in "bottom" part
p = 0.1) %>% # probability of edge
ggraph(., layout = 'bipartite') +
geom_edge_link() +
geom_node_point(aes(colour = type),
size = 3) +
theme_graph() +
scale_colour_manual(values = c('goldenrod', 'skyblue'))
Sometimes we want to generate one-way graphs with defined “communities,” or two or more sections of the graph which are more connected within that section than between the sections. These type of random graphs are very useful for testing community detection algorithms. This type of graph is conceptually and fundamentally very different from a bipartite network where the nodes in a set cannot be connected to any other nodes in the same set. Additionally, the community memberships are not returned.
The stochastic block model is the simplest “component game.” First, we need to define how many total nodes there are. Second, the block memberships of the nodes are defined. Finally, we define a preference matrix which describes the probability of connections within and between the different components. Edges are assigned in the same manner as the Erdos-Renyi graph model using the approach probability as defined in the preference matrix.
block_size <- c(30, 70)
pref_matrix <- matrix(c(0.1, 0.005,
0.001, 0.05), nrow = 2)
play_blocks(n = 100, # how many nodes
size_blocks = block_size, # size of each block
p_between = pref_matrix) %>% # connectedness probability matrix
ggraph(.) +
geom_edge_link() +
geom_node_point(size = 3,
colour = 'blue') +
theme_graph()
## Using `nicely` as default layout